Biostatistics For Dummies (Monika Wahi John Pezzullo)

regression, except now, there are more equations to solve. In multiple as in straight-line regression,

you can also get the information you need to estimate the standard errors (SEs) of the parameters.

Executing a Multiple Regression Analysis in

Software

Before executing your multiple regression analysis, you may need to do some prep work on the

variables you intend to include in your model. In the following sections, we explain how to handle the

categorical variables you plan to include. We show you how to examine these variables through

making several charts before you run your analysis. If you need guidance on what variables to consider

for your models, read Chapter 20.

Preparing categorical variables

The predictors in a multiple regression model can be either numerical or categorical (Chapter 8

discusses the different types of data). In a categorical variable, each category is called a level. If a

variable, like Setting, can have only two levels, like Inpatient or Outpatient, then it’s called a

dichotomous or a binary categorical variable. If it can have more than two levels, it is called a

multilevel variable.

Figuring out the best way to introduce categorical predictors into a multiple regression model is

always challenging. You have to set up your data the right way, or you’ll get results that are either

wrong, or difficult to interpret properly. Following are two important factors to consider.

Having enough participants in each level of each categorical variable

Before using a categorical variable in a multiple regression model, you should tabulate how many

participants (or rows) are included in each level. If you have any sparse levels — row frequencies in

the single digits — you will want to consider collapsing them into others. Usually, the more evenly

distributed the number of rows are across all the levels, and the fewer levels there are, the more

precise and reliable the results. If a level doesn’t contain enough rows, the program may ignore that

level, halt with a warning message, produce incorrect results, or crash. Worse, if it produces results,

they will be impossible to interpret.

Imagine that you create a one-way frequency table of a Primary Diagnosis variable from a sample of

study participant data. Your results are: Hypertension: 73, Diabetes: 35, Cancer: 1, and Other: 10. To

deal with the sparse Cancer variable, you may want to create another variable in which Cancer is

collapsed together with Other (which would then have 11 rows). Another approach is to create a

binary variable with yes/no levels, such as: Hypertension: 73 and No Hypertension: 46. But binary

variables don’t take into account the other levels. You could also make a binary Diabetes variable,

where 35 were coded as yes and the rest were no, and so on for Cancer and Other.

Similarly, if your model has two categorical variables with an interaction term (like Setting +

Primary Diagnosis + Setting * Primary Diagnosis), you should prepare a two-way cross-tabulation of

the two variables first (in our example, Setting by Primary Diagnosis). You will observe that you are

limited by having to ensure that you have enough rows in each cell of the table to run your analysis.

See Chapter 12 for details about cross-tabulations.